Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added line cache #678

Closed
wants to merge 2 commits into from
Closed

added line cache #678

wants to merge 2 commits into from

Conversation

nitinNT
Copy link

@nitinNT nitinNT commented Oct 30, 2022

No description provided.

@5nord 5nord marked this pull request as ready for review October 30, 2022 09:28
@5nord 5nord marked this pull request as draft October 30, 2022 09:28
Comment on lines 288 to 293
if t.cachedLineBegin <= pos && pos < t.cachedLineEnd {
return Position{
Line: t.cacheLine + 1,
Column: pos - t.lines[t.cacheLine] + 1,
}
}
Copy link
Member

@5nord 5nord Oct 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache-handling is distributed over two methods (tree.position and tree.searchLines). I suggest you put all cache-related code into only one function (either in tree.position or in tree.searchLines, but not both). This increases cohesion and make the code easier to reason about.

Comment on lines 306 to 307
t.cachedLineBegin = i
t.cachedLineEnd = j
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those two lines are the reason why the benchmarks are not as good as expected and why the tests fail:
i and j represent lines, while cachedLineBegin and cachedLineEnd are supposed to represent (file) offsets. In line 288 you're essentially comparing lines with offsets:

if t.cachedLineBegin <= pos && pos < t.cachedLineEnd

I think those two variables should be set after you found the proper line:

Suggested change
t.cachedLineBegin = i
t.cachedLineEnd = j

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have now put the comparison steps in searchLines function as below so searchLines function will return cacheLine

	if t.cachedLineBegin <= i && j < t.cachedLineEnd {
		return t.cacheLine
	}

is this correct ?

Copy link
Member

@5nord 5nord Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost. Instead of i and j you should compare with pos:

if t.cachedLineBegin <= pos && pos < t.cachedLineEnd {
        return t.cacheLine
}

I'd like to explain a bit how location handling works, this will clear things up:

Overview

One core functionality of this packages is to provide start and end position for syntax nodes. We call this a text span and it includes a file name, line numbers, column numbers and the file offsets. We could store this information inside the node object. For example:

type Node struct {
        // ...
        Span struct {
                Filename string
                Begin, End struct {
                        Line, Column, Offset int
                }
        }
}

This approach is strait-forward and easy, but also quite wasteful. It would add 64 bytes of repetitive data to every node and sometimes we need this detailed information only for a few nodes.

So, how can we improve this situation?

Instead of storing the whole span, we store the begin and end offsets (8 bytes instead of 64) and provide an algorithm to compute the detailed information only when needed. For example the node would look like this:

type Node struct {
        // ...
        Begin, End int32
}

How do we get from offset to line numbers?

When creating the syntax tree, we keep track of newline characters (\n) and store the start-offset for every line in a slice.

Example: Lines slice

The input Foo\nBar Baz\nFoobar\n contains three lines, the first line starts is at offset 0, the second line starts at offset 4, the last line starts at offset 12. The resulting lines slice would like this:

lines := []int{0, 4, 12}

Example: Lookup

In which line is the word Baz? We know from the syntax nodes, that the word begins at offset 8.
Now we search in the lines slice to find out to which line the offset 8 belongs to:

for i := range lines {
        if 8 < lines[i] {
                return i-1
        }
}  
return len(lines)-1

We need to subtract 1 from the found index, because this loop exits at one line after the one we were actually searching for; in this case it breaks at the third line (i==2).

Binary Search

Our files have 300-1500 lines on average. As a consequence the described lookup becomes very inefficient, because it's just a linear search. We can do much better if we use a binary search algorithm instead.
Here are some articles I found explaining binary search:

The tree.searchLines implements such a binary search. The variables i and j are the lower- and the higher limits required for the binary search. That's why i starts at 0 and walks towards the end of the lines slice, while j starts at the end (len(t.lines)) and walks towards the beginning of the lines slice.

Caching

The caching is independent of those binary search variables. For caching we need to check if the pos offset is on the same line as the last time searchLines was called. We could do it this way:

if t.cachedLineBegin <= pos && pos < t.cachedLineEnd {
        return t.cacheLine
}

I just realized you don't need store the offsets in cachedLineBegin and cachedLineEnd, but you can do the check directly in the lines-slice:

if len(t.lines) > 1 && t.lines[t.cachedLine] <=  pos && pos < t.lines[t.cachedLine+1] {
        return t.cachedLine
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for detailed explanation.

@@ -306,6 +314,7 @@ func (t *tree) searchLines(pos int) int {
j = h
}
}
t.cacheLine = int(i) - 1
Copy link
Member

@5nord 5nord Oct 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t.cacheLine = int(i) - 1
line := int(i) - 1
t.cachedLineBegin = t.lines[line]
t.cachedLineEnd = t.lines[line+1]
t.cacheLine = line

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have tried this but for line 319 its giving me runtime error: index out of range [1] with length 1 as line array contains only one element.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. t.lines[lint+1] does not work for the last line.

Maybe we could fix it with a condition? No very elegant, I know:

if line < len(t.lines) {
    t.cachedLineEnd = t.lines[line+1]
} else {
    t.cachedLineEnd = len(t.lines)
}

@5nord 5nord closed this Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants